{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "7ff1e403",
   "metadata": {},
   "source": [
    "在以英语为代表的印欧语系中，大部分语言都使用空格字符来切分词。因此分词的一种非常简单的方式就是基于空格进行分词：\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "5d203093",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "输入语句：I learn natural language processing with dongshouxueNLP, too.\n",
      "分词结果：['I', 'learn', 'natural', 'language', 'processing', 'with', 'dongshouxueNLP,', 'too.']\n"
     ]
    }
   ],
   "source": [
    "sentence = \"I learn natural language processing with dongshouxueNLP, too.\"\n",
    "tokens = sentence.split(' ')\n",
    "print(f'输入语句：{sentence}')\n",
    "print(f\"分词结果：{tokens}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2bc0fa24",
   "metadata": {},
   "source": [
    "从上面的代码可以看到，最简单的基于空格的分词方法无法将词与词后面的标点符号分割。如果标点符号对于后续任务（例如文本分类）并不重要，可以去除这些标点符号后再进一步分词：\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "fe2f9e9d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "输入语句：I learn natural language processing with dongshouxueNLP, too.\n",
      "分词结果：['I', 'learn', 'natural', 'language', 'processing', 'with', 'dongshouxueNLP', 'too']\n"
     ]
    }
   ],
   "source": [
    "#引入正则表达式包\n",
    "import re\n",
    "sentence = \"I learn natural language processing with dongshouxueNLP, too.\"\n",
    "print(f'输入语句：{sentence}')\n",
    "\n",
    "#去除句子中的“,”和“.”\n",
    "sentence = re.sub(r'\\,|\\.','',sentence)\n",
    "tokens = sentence.split(' ')\n",
    "print(f\"分词结果：{tokens}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6a1fb4d0",
   "metadata": {},
   "source": [
    "正则表达式使用单个字符串（通常称为“模式”即pattern）来描述、匹配对应文本中全部匹配某个指定规则的字符串。\n",
    "我们也可以使用正则表达式来实现空格分词：\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "391cf386",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Did', 'you', 'spend', '3', '4', 'on', 'arxiv', 'org', 'for', 'your', 'pre', 'print', 'No', 'it', 's', 'free', 'It', 's']\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "sentence = \"Did you spend $3.4 on arxiv.org for your pre-print?\"+\\\n",
    "    \" No, it's free! It's ...\"\n",
    "# 其中，\\w表示匹配a-z，A-Z，0-9和“_”这4种类型的字符，等价于[a-zA-Z0-9_]，\n",
    "# +表示匹配前面的表达式1次或者多次。因此\\w+表示匹配上述4种类型的字符1次或多次。\n",
    "pattern = r\"\\w+\"\n",
    "print(re.findall(pattern, sentence))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "28ffa18a",
   "metadata": {},
   "source": [
    "处理标点："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "5c616bf5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Did', 'you', 'spend', '$3', '.4', 'on', 'arxiv', '.org', 'for', 'your', 'pre', '-print', '?', 'No', ',', 'it', \"'s\", 'free', '!', 'It', \"'s\", '.', '.', '.']\n"
     ]
    }
   ],
   "source": [
    "\n",
    "# 可以在正则表达式中使用\\S来表示除了空格以外的所有字符（\\s在正则表达式中表示空格字符，\\S则相应的表示\\s的补集）\n",
    "# |表示或运算，*表示匹配前面的表达式0次或多次，\\S\\w* 表示先匹配除了空格以外的1个字符，后面可以包含0个或多个\\w字符。\n",
    "pattern = r\"\\w+|\\S\\w*\"\n",
    "print(re.findall(pattern, sentence))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "28f7cbab",
   "metadata": {},
   "source": [
    "处理连字符："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "1a678ed0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Did', 'you', 'spend', '3', '4', 'on', 'arxiv', 'org', 'for', 'your', 'pre-print', 'No', \"it's\", 'free', \"It's\"]\n"
     ]
    }
   ],
   "source": [
    "\n",
    "# -表示匹配连字符-，(?:[-']\\w+)*表示匹配0次或多次括号内的模式。(?:...)表示匹配括号内的模式，\n",
    "# 可以和+/*等符号连用。其中?:表示不保存匹配到的括号中的内容，是re代码库中的特殊标准要求的部分。\n",
    "pattern = r\"\\w+(?:[-']\\w+)*\"\n",
    "print(re.findall(pattern, sentence))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "af2bfcce",
   "metadata": {},
   "source": [
    "将前面的匹配符号的模式\\S\\w\\*组合起来，可以得到一个既可以处理标点符号又可以处理连字符的正则表达式："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "d4a45574",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Did', 'you', 'spend', '$3', '.4', 'on', 'arxiv', '.org', 'for', 'your', 'pre-print', '?', 'No', ',', \"it's\", 'free', '!', \"It's\", '.', '.', '.']\n"
     ]
    }
   ],
   "source": [
    "pattern = r\"\\w+(?:[-']\\w+)*|\\S\\w*\"\n",
    "print(re.findall(pattern, sentence))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1a4b2e70",
   "metadata": {},
   "source": [
    "\n",
    "在英文简写和网址中，常常会使用'.'，它与英文中的句号为同一个符号，匹配这种情况的正则表达式为：\n",
    "\n",
    "* 正则表达式模式：(\\w+\\\\.)+\\w+(\\\\.)*\n",
    "* 符合匹配的字符串示例：\n",
    "    * U.S.A.、arxiv.org\n",
    "* 不符合的字符串示例：\n",
    "    * <span style=\"text-decoration:underline\">$</span>3<span style=\"text-decoration:underline\">.</span>4、..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "06772a7b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Did', 'you', 'spend', '$3', '.4', 'on', 'arxiv.org', 'for', 'your', 'pre-print', '?', 'No', ',', \"it's\", 'free', '!', \"It's\", '.', '.', '.']\n"
     ]
    }
   ],
   "source": [
    "#新的匹配模式\n",
    "new_pattern = r\"(?:\\w+\\.)+\\w+(?:\\.)*\"\n",
    "pattern = new_pattern +r\"|\"+pattern\n",
    "print(re.findall(pattern, sentence))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8eabc8e5",
   "metadata": {},
   "source": [
    "需要注意的是，字符“.”在正则表达式中表示匹配任意字符，因此要表示字符本身的含义时，需要在该符号前面加入转义字符（Escape Character）\"\\\\\"，即“\\\\.”。同理，想要表示“+”“？”“(”“)”“$”这些特殊字符时，需要在前面加入转义字符“\\\\”。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a1808683",
   "metadata": {},
   "source": [
    "在许多语言中，货币和百分比符号与数字是直接相连的，匹配这种情况的正则表达式为：\n",
    "\n",
    "* 正则表达式模式：\\\\$?\\d+(\\\\.\\d+)?%?\n",
    "* 符合匹配的字符串示例：\n",
    "    * \\$3.40、3.5%\n",
    "* 不符合的字符串示例：\n",
    "    * \\$.4、1.4.0、1\\%\\%"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "1bd6d2c5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Did', 'you', 'spend', '$3.4', 'on', 'arxiv.org', 'for', 'your', 'pre-print', '?', 'No', ',', \"it's\", 'free', '!', \"It's\", '.', '.', '.']\n"
     ]
    }
   ],
   "source": [
    "#新的匹配pattern，匹配价格符号\n",
    "new_pattern2 = r\"\\$?\\d+(?:\\.\\d+)?%?\"\n",
    "pattern = new_pattern2 +r\"|\" + new_pattern +r\"|\"+pattern\n",
    "print(re.findall(pattern, sentence))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c4971854",
   "metadata": {},
   "source": [
    "其中\\d表示所有的数字字符，?表示匹配前面的模式0次或者1次。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d4dde345",
   "metadata": {},
   "source": [
    "省略号本身表达了一定的含义，因此要在分词中将其保留，匹配它的正则表达式为：\n",
    "\n",
    "* 正则表达式模式：$\\text{\\\\}$.$\\text{\\\\}$.$\\text{\\\\}$.\n",
    "* 符合匹配的字符串示例：\n",
    "    * ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "1ad794a2",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Did', 'you', 'spend', '$3.4', 'on', 'arxiv.org', 'for', 'your', 'pre-print', '?', 'No', ',', \"it's\", 'free', '!', \"It's\", '...']\n"
     ]
    }
   ],
   "source": [
    "#新的匹配pattern，匹配价格符号\n",
    "new_pattern3 = r\"\\.\\.\\.\" \n",
    "pattern = new_pattern3 +r\"|\" + new_pattern2 +r\"|\" +\\\n",
    "    new_pattern +r\"|\"+pattern\n",
    "print(re.findall(pattern, sentence))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "7b5ec308",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Collecting nltk\n",
      "  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)\n",
      "Collecting click (from nltk)\n",
      "  Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB)\n",
      "Collecting joblib (from nltk)\n",
      "  Downloading joblib-1.5.0-py3-none-any.whl.metadata (5.6 kB)\n",
      "Collecting regex>=2021.8.3 (from nltk)\n",
      "  Downloading regex-2024.11.6-cp312-cp312-win_amd64.whl.metadata (41 kB)\n",
      "Collecting tqdm (from nltk)\n",
      "  Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)\n",
      "Requirement already satisfied: colorama in c:\\users\\孔子\\desktop\\社会网络舆情\\@hands-on-nlp-main\\@hands-on-nlp-main\\.venv\\lib\\site-packages (from click->nltk) (0.4.6)\n",
      "Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)\n",
      "   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--\n",
      "   ------ --------------------------------- 0.3/1.5 MB ? eta -:--:--\n",
      "   ---------------------------------------- 1.5/1.5 MB 5.7 MB/s eta 0:00:00\n",
      "Downloading regex-2024.11.6-cp312-cp312-win_amd64.whl (273 kB)\n",
      "Downloading click-8.1.8-py3-none-any.whl (98 kB)\n",
      "Downloading joblib-1.5.0-py3-none-any.whl (307 kB)\n",
      "Using cached tqdm-4.67.1-py3-none-any.whl (78 kB)\n",
      "Installing collected packages: tqdm, regex, joblib, click, nltk\n",
      "\n",
      "   -------- ------------------------------- 1/5 [regex]\n",
      "   ---------------- ----------------------- 2/5 [joblib]\n",
      "   ---------------- ----------------------- 2/5 [joblib]\n",
      "   ---------------- ----------------------- 2/5 [joblib]\n",
      "   ------------------------ --------------- 3/5 [click]\n",
      "   -------------------------------- ------- 4/5 [nltk]\n",
      "   -------------------------------- ------- 4/5 [nltk]\n",
      "   -------------------------------- ------- 4/5 [nltk]\n",
      "   -------------------------------- ------- 4/5 [nltk]\n",
      "   -------------------------------- ------- 4/5 [nltk]\n",
      "   -------------------------------- ------- 4/5 [nltk]\n",
      "   -------------------------------- ------- 4/5 [nltk]\n",
      "   -------------------------------- ------- 4/5 [nltk]\n",
      "   -------------------------------- ------- 4/5 [nltk]\n",
      "   -------------------------------- ------- 4/5 [nltk]\n",
      "   -------------------------------- ------- 4/5 [nltk]\n",
      "   -------------------------------- ------- 4/5 [nltk]\n",
      "   ---------------------------------------- 5/5 [nltk]\n",
      "\n",
      "Successfully installed click-8.1.8 joblib-1.5.0 nltk-3.9.1 regex-2024.11.6 tqdm-4.67.1\n",
      "Note: you may need to restart the kernel to use updated packages.\n"
     ]
    }
   ],
   "source": [
    "pip install nltk"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2598928d",
   "metadata": {},
   "source": [
    "<!-- #### 5. 基于NLTK工具包的分词 -->\n",
    "NLTK {cite:p}`bird2009natural` 是基于Python的NLP工具包，也可以用于实现前面提到的基于正则表达式的分词。\n",
    "\n",
    "<!-- 它包含了50个语料库和词汇库（如WordNet），并提供易于使用的接口。NLTK中包含了分类（Classification）、词性标注（Part-of-Speech Tagging，POS Tagging）、词干还原（Stemming）、序列标注（Sequence Labeling）、句法（Syntactic Parsing）和语义分析（Semantic Parsing）等文本处理工具。NLTK 是一个免费、开源、社区驱动的项目，适用于Windows、Mac OS X和Linux等各种平台，适用于各种用户人群。\n",
    " -->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "8ec8b313",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Did', 'you', 'spend', '$3.4', 'on', 'arxiv.org', 'for', 'your', 'pre-print', '?', 'No', ',', \"it's\", 'free', '!', \"It's\", '...']\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "import nltk\n",
    "#引入NLTK分词器\n",
    "from nltk.tokenize import word_tokenize\n",
    "from nltk.tokenize import regexp_tokenize\n",
    "\n",
    "tokens = regexp_tokenize(sentence,pattern)\n",
    "print(tokens)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9b1a7442",
   "metadata": {},
   "source": [
    "基于BPE的词元学习器。\n",
    "\n",
    "给定一个词表包含所有的字符（如，{A, B, C, D, ..., a, b, c, d, ...}），词元学习器重复以下步骤来构建词表：\n",
    "\n",
    "（1）找出在训练语料中最常相连的两个符号，这里称其为“$C_1$”和“$C_2$”；\n",
    "\n",
    "（2）将新组合的符号“$C_1$$C_2$”加入词表当中；\n",
    "\n",
    "（3）将训练语料中所有相连的“$C_1$”和“$C_2$”转换成“$C_1$$C_2$”；\n",
    "\n",
    "（4）重复上述步骤$k$次。\n",
    "\n",
    "假设有一个训练语料包含了一些方向和中国的地名的拼音：\n",
    "```\n",
    "nan nan nan nan nan nanjing nanjing beijing beijing beijing beijing beijing beijing dongbei dongbei dongbei bei bei\n",
    "```\n",
    "首先，我们基于空格将语料分解成词元，然后加入特殊符号“_”来作为词尾的标识符，通过这种方式可以更好地去包含相似子串的词语（例如区分al在form<span style=\"text-decoration:underline\">al</span>和<span style=\"text-decoration:underline\">al</span>most中的区别）。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2617fbb2",
   "metadata": {},
   "source": [
    "第一步，根据语料构建初始的词表："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "bc7f60a9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "语料：\n",
      "5 ['n', 'a', 'n', '_']\n",
      "2 ['n', 'a', 'n', 'j', 'i', 'n', 'g', '_']\n",
      "6 ['b', 'e', 'i', 'j', 'i', 'n', 'g', '_']\n",
      "3 ['d', 'o', 'n', 'g', 'b', 'e', 'i', '_']\n",
      "2 ['b', 'e', 'i', '_']\n",
      "词表：['_', 'a', 'b', 'd', 'e', 'g', 'i', 'j', 'n', 'o']\n"
     ]
    }
   ],
   "source": [
    "corpus = \"nan nan nan nan nan nanjing nanjing beijing beijing \"+\\\n",
    "    \"beijing beijing beijing beijing dongbei dongbei dongbei bei bei\"\n",
    "tokens = corpus.split(' ')\n",
    "\n",
    "#构建基于字符的初始词表\n",
    "vocabulary = set(corpus) \n",
    "vocabulary.remove(' ')\n",
    "vocabulary.add('_')\n",
    "vocabulary = sorted(list(vocabulary))\n",
    "\n",
    "#根据语料构建词表\n",
    "corpus_dict = {}\n",
    "for token in tokens:\n",
    "    key = token+'_'\n",
    "    if key not in corpus_dict:\n",
    "        corpus_dict[key] = {\"split\": list(key), \"count\": 0}\n",
    "    corpus_dict[key]['count'] += 1\n",
    "\n",
    "print(f\"语料：\")\n",
    "for key in corpus_dict:\n",
    "    print(corpus_dict[key]['count'], corpus_dict[key]['split'])\n",
    "print(f\"词表：{vocabulary}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fa28c018",
   "metadata": {},
   "source": [
    "第二步，词元学习器通过迭代的方式逐步组合新的符号加入到词表中："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "1409448a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "第1次迭代\n",
      "当前最常出现的前5个符号组合：[('ng', 11), ('be', 11), ('ei', 11), ('ji', 8), ('in', 8)]\n",
      "本次迭代组合的符号为：ng\n",
      "\n",
      "迭代后的语料为：\n",
      "5 ['n', 'a', 'n', '_']\n",
      "2 ['n', 'a', 'n', 'j', 'i', 'ng', '_']\n",
      "6 ['b', 'e', 'i', 'j', 'i', 'ng', '_']\n",
      "3 ['d', 'o', 'ng', 'b', 'e', 'i', '_']\n",
      "2 ['b', 'e', 'i', '_']\n",
      "词表：['_', 'a', 'b', 'd', 'e', 'g', 'i', 'j', 'n', 'o', 'ng']\n",
      "\n",
      "-------------------------------------\n",
      "第2次迭代\n",
      "当前最常出现的前5个符号组合：[('be', 11), ('ei', 11), ('ji', 8), ('ing', 8), ('ng_', 8)]\n",
      "本次迭代组合的符号为：be\n",
      "\n",
      "迭代后的语料为：\n",
      "5 ['n', 'a', 'n', '_']\n",
      "2 ['n', 'a', 'n', 'j', 'i', 'ng', '_']\n",
      "6 ['be', 'i', 'j', 'i', 'ng', '_']\n",
      "3 ['d', 'o', 'ng', 'be', 'i', '_']\n",
      "2 ['be', 'i', '_']\n",
      "词表：['_', 'a', 'b', 'd', 'e', 'g', 'i', 'j', 'n', 'o', 'ng', 'be']\n",
      "\n",
      "-------------------------------------\n",
      "第3次迭代\n",
      "当前最常出现的前5个符号组合：[('bei', 11), ('ji', 8), ('ing', 8), ('ng_', 8), ('na', 7)]\n",
      "本次迭代组合的符号为：bei\n",
      "\n",
      "迭代后的语料为：\n",
      "5 ['n', 'a', 'n', '_']\n",
      "2 ['n', 'a', 'n', 'j', 'i', 'ng', '_']\n",
      "6 ['bei', 'j', 'i', 'ng', '_']\n",
      "3 ['d', 'o', 'ng', 'bei', '_']\n",
      "2 ['bei', '_']\n",
      "词表：['_', 'a', 'b', 'd', 'e', 'g', 'i', 'j', 'n', 'o', 'ng', 'be', 'bei']\n",
      "\n",
      "-------------------------------------\n",
      "第9次迭代\n",
      "当前最常出现的前5个符号组合：[('beijing_', 6), ('nan_', 5), ('bei_', 5), ('do', 3), ('ong', 3)]\n",
      "本次迭代组合的符号为：beijing_\n",
      "\n",
      "迭代后的语料为：\n",
      "5 ['nan', '_']\n",
      "2 ['nan', 'jing_']\n",
      "6 ['beijing_']\n",
      "3 ['d', 'o', 'ng', 'bei', '_']\n",
      "2 ['bei', '_']\n",
      "词表：['_', 'a', 'b', 'd', 'e', 'g', 'i', 'j', 'n', 'o', 'ng', 'be', 'bei', 'ji', 'jing', 'jing_', 'na', 'nan', 'beijing_']\n",
      "\n",
      "-------------------------------------\n"
     ]
    }
   ],
   "source": [
    "for step in range(9):\n",
    "    # 如果想要将每一步的结果都输出，请读者自行将max_print_step改成999\n",
    "    max_print_step = 3\n",
    "    if step < max_print_step or step == 8: \n",
    "        print(f\"第{step+1}次迭代\")\n",
    "    split_dict = {}\n",
    "    for key in corpus_dict:\n",
    "        splits = corpus_dict[key]['split']\n",
    "        # 遍历所有符号进行统计\n",
    "        for i in range(len(splits)-1):\n",
    "            # 组合两个符号作为新的符号\n",
    "            current_group = splits[i]+splits[i+1]\n",
    "            if current_group not in split_dict:\n",
    "                split_dict[current_group] = 0\n",
    "            split_dict[current_group] += corpus_dict[key]['count']\n",
    "\n",
    "    group_hist=[(k, v) for k, v in sorted(split_dict.items(), \\\n",
    "        key=lambda item: item[1],reverse=True)]\n",
    "    if step < max_print_step or step == 8:\n",
    "        print(f\"当前最常出现的前5个符号组合：{group_hist[:5]}\")\n",
    "    \n",
    "    merge_key = group_hist[0][0]\n",
    "    if step < max_print_step or step == 8:\n",
    "        print(f\"本次迭代组合的符号为：{merge_key}\")\n",
    "    for key in corpus_dict:\n",
    "        if merge_key in key:\n",
    "            new_splits = []\n",
    "            splits = corpus_dict[key]['split']\n",
    "            i = 0\n",
    "            while i < len(splits):\n",
    "                if i+1>=len(splits):\n",
    "                    new_splits.append(splits[i])\n",
    "                    i+=1\n",
    "                    continue\n",
    "                if merge_key == splits[i]+splits[i+1]:\n",
    "                    new_splits.append(merge_key)\n",
    "                    i+=2\n",
    "                else:\n",
    "                    new_splits.append(splits[i])\n",
    "                    i+=1\n",
    "            corpus_dict[key]['split']=new_splits\n",
    "            \n",
    "    vocabulary.append(merge_key)\n",
    "    if step < max_print_step or step == 8:\n",
    "        print()\n",
    "        print(f\"迭代后的语料为：\")\n",
    "        for key in corpus_dict:\n",
    "            print(corpus_dict[key]['count'], corpus_dict[key]['split'])\n",
    "        print(f\"词表：{vocabulary}\")\n",
    "        print()\n",
    "        print('-------------------------------------')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b41e133d",
   "metadata": {},
   "source": [
    "得到学习到的词表之后，给定一句新的句子，使用BPE词元分词器根据词表中每个符号学到的顺序，贪心地将字符组合起来。例如输入是“nanjing beijing”，那么根据上面例子里的词表，会先把“n”和“g”组合成“ng”，然后组合“be”“bei”……最终分词成："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "9a938e54",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "输入语句：nanjing beijing\n",
      "分词结果：['nan', 'jing_', 'beijing_']\n"
     ]
    }
   ],
   "source": [
    "ordered_vocabulary = {key: x for x, key in enumerate(vocabulary)}\n",
    "sentence = \"nanjing beijing\"\n",
    "print(f\"输入语句：{sentence}\")\n",
    "tokens = sentence.split(' ')\n",
    "tokenized_string = []\n",
    "for token in tokens:\n",
    "    key = token+'_'\n",
    "    splits = list(key)\n",
    "    #用于在没有更新的时候跳出\n",
    "    flag = 1\n",
    "    while flag:\n",
    "        flag = 0\n",
    "        split_dict = {}\n",
    "        #遍历所有符号进行统计\n",
    "        for i in range(len(splits)-1): \n",
    "            #组合两个符号作为新的符号\n",
    "            current_group = splits[i]+splits[i+1] \n",
    "            if current_group not in ordered_vocabulary:\n",
    "                continue\n",
    "            if current_group not in split_dict:\n",
    "                #判断当前组合是否在词表里，如果是的话加入split_dict\n",
    "                split_dict[current_group] = ordered_vocabulary[current_group] \n",
    "                flag = 1\n",
    "        if not flag:\n",
    "            continue\n",
    "            \n",
    "        #对每个组合进行优先级的排序（此处为从小到大）\n",
    "        group_hist=[(k, v) for k, v in sorted(split_dict.items(),\\\n",
    "            key=lambda item: item[1])] \n",
    "        #优先级最高的组合\n",
    "        merge_key = group_hist[0][0] \n",
    "        new_splits = []\n",
    "        i = 0\n",
    "        # 根据优先级最高的组合产生新的分词\n",
    "        while i < len(splits):\n",
    "            if i+1>=len(splits):\n",
    "                new_splits.append(splits[i])\n",
    "                i+=1\n",
    "                continue\n",
    "            if merge_key == splits[i]+splits[i+1]:\n",
    "                new_splits.append(merge_key)\n",
    "                i+=2\n",
    "            else:\n",
    "                new_splits.append(splits[i])\n",
    "                i+=1\n",
    "        splits=new_splits\n",
    "    tokenized_string+=splits\n",
    "\n",
    "print(f\"分词结果：{tokenized_string}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "432defd3",
   "metadata": {},
   "source": [
    "\n",
    "大小写折叠（case folding）是将所有的英文大写字母转化成小写字母的过程。在搜索场景中，用户往往喜欢使用小写，而在计算机中，大写字母和小写字母并非同一字符，当遇到用户想要搜索一些人名、地名等带有大写字母的专有名词的情况下，正确的搜索结果可能会比较难匹配上。\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "ff4810fc",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "let's study hands-on-nlp\n"
     ]
    }
   ],
   "source": [
    "# Case Folding\n",
    "sentence = \"Let's study Hands-on-NLP\"\n",
    "print(sentence.lower())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2ad4136b",
   "metadata": {},
   "source": [
    "\n",
    "在诸如英文这样的语言中，很多单词都会根据不同的主语、语境、时态等情形修改形态，而这些单词本身表达的含义是接近甚至是相同的。例如英文中的am、is、are都可以还原成be，英文名词cat根据不同情形有cat、cats、cat's、cats'等多种形态。这些形态对文本的语义影响相对较小，但是大幅度提高了词表的大小，因而提高了自然语言处理模型的构建成本。因此在有些文本处理问题上，会将所有的词进行词目还原（lemmatization），即找出词的原型。人类在学习这些语言的过程中，可以通过词典找词的原型；类似地，计算机可以通过建立词典来进行词目还原：\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "cc1f7c18",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "词目还原前：['Two', 'dogs', 'are', 'chasing', 'three', 'cats']\n",
      "词目还原后：['Two', 'dog', 'be', 'chase', 'three', 'cat']\n"
     ]
    }
   ],
   "source": [
    "# 构建词典\n",
    "lemma_dict = {'am': 'be','is': 'be','are': 'be','cats': 'cat',\\\n",
    "    \"cats'\": 'cat',\"cat's\": 'cat','dogs': 'dog',\"dogs'\": 'dog',\\\n",
    "    \"dog's\": 'dog', 'chasing': \"chase\"}\n",
    "\n",
    "sentence = \"Two dogs are chasing three cats\"\n",
    "words = sentence.split(' ')\n",
    "print(f'词目还原前：{words}')\n",
    "lemmatized_words = []\n",
    "for word in words:\n",
    "    if word in lemma_dict:\n",
    "        lemmatized_words.append(lemma_dict[word])\n",
    "    else:\n",
    "        lemmatized_words.append(word)\n",
    "\n",
    "print(f'词目还原后：{lemmatized_words}')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6b6d329d",
   "metadata": {},
   "source": [
    "另外，也可以利用NLTK自带的词典来进行词目还原："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "7bf7cbd4",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "词目还原前：['Two', 'dogs', 'are', 'chasing', 'three', 'cats']\n",
      "词目还原后：['Two', 'dog', 'be', 'chase', 'three', 'cat']\n"
     ]
    }
   ],
   "source": [
    "import nltk\n",
    "#引入nltk分词器、lemmatizer，引入wordnet还原动词\n",
    "from nltk.tokenize import word_tokenize\n",
    "from nltk.stem import WordNetLemmatizer\n",
    "from nltk.corpus import wordnet\n",
    "\n",
    "#下载分词包、wordnet包\n",
    "nltk.download('punkt', quiet=True)\n",
    "nltk.download('wordnet', quiet=True)\n",
    "\n",
    "\n",
    "lemmatizer = WordNetLemmatizer()\n",
    "sentence = \"Two dogs are chasing three cats\"\n",
    "\n",
    "words = word_tokenize(sentence)\n",
    "print(f'词目还原前：{words}')\n",
    "lemmatized_words = []\n",
    "for word in words:\n",
    "    lemmatized_words.append(lemmatizer.lemmatize(word, wordnet.VERB))\n",
    "\n",
    "print(f'词目还原后：{lemmatized_words}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d74d3eea",
   "metadata": {},
   "source": [
    "\n",
    "很多实际场景中，我们往往需要处理很长的文本，例如新闻、财报、日志等。让计算机直接同时处理整个文本会非常的困难，因此需要将文本分成许多句子来让计算机分别进行处理。对于分句问题，最常见的方法是根据标点符号来分割文本，例如“！”“？”“。”等符号。然而，在某些语言当中，个别分句符号会有歧义。例如英文中的句号“.”也同时有省略符（例如“Inc.”、“Ph.D.”、“Mr.”等）、小数点（例如“3.5”、“.3%”）等含义。这些歧义会导致分句困难。为了解决这种问题，常见的方案是先进行分词，使用基于正则表达式或者基于机器学习的分词方法将文本分解成词元，随后基于符号判断句子边界。例如："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "4071468e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "分句结果：\n",
      "['Did', 'you', 'spend', '$3.4', 'on', 'arxiv.org', 'for', 'your', 'pre-print', '?']\n",
      "['No', ',', \"it's\", 'free', '!']\n",
      "[\"It's\", '...']\n"
     ]
    }
   ],
   "source": [
    "sentence_spliter = set([\".\",\"?\",'!','...'])\n",
    "sentence = \"Did you spend $3.4 on arxiv.org for your pre-print? \" + \\\n",
    "    \"No, it's free! It's ...\"\n",
    "\n",
    "tokens = regexp_tokenize(sentence,pattern)\n",
    "\n",
    "sentences = []\n",
    "boundary = [0]\n",
    "for token_id, token in enumerate(tokens):\n",
    "    # 判断句子边界\n",
    "    if token in sentence_spliter:\n",
    "        #如果是句子边界，则把分句结果加入进去\n",
    "        sentences.append(tokens[boundary[-1]:token_id+1]) \n",
    "        #将下一句句子起始位置加入boundary\n",
    "        boundary.append(token_id+1) \n",
    "\n",
    "if boundary[-1]!=len(tokens):\n",
    "    sentences.append(tokens[boundary[-1]:])\n",
    "\n",
    "print(f\"分句结果：\")\n",
    "for seg_sentence in sentences:\n",
    "    print(seg_sentence)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "56f5a604",
   "metadata": {},
   "source": [
    "<!-- ## 参考文献\n",
    "\n",
    "```{bibliography}\n",
    ":style: plain\n",
    ":filter: docname in docnames\n",
    "``` -->"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
